Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) by jfprincz · Pull Request #315 · openai/parameter-golf

jfprincz · 2026-03-21T06:10:31Z

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s

Progress from prior submissions

	PR #70	PR #164	PR #198	PR #287	This	Delta vs #287
val_bpb (sliding)	1.1659 (s256)	1.1524 (s256)	1.1318 (s64)	1.1271 (s64)	1.1248 (s64)	-0.0023
Layers	9	9	11	11	11	—
Params	21.8M	22.4M	26.8M	26.8M	26.8M	—
Artifact	14.9 MB	15.4 MB	15.7 MB	15.5 MB	15.6 MB	+0.1 MB

Two new techniques on top of PR #287's 11-layer stack.

Key additions over PR #287

Change	Impact
Partial RoPE (16 of 64 dims)	Apply rotary embeddings to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale	RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training. Zero new parameters.

Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric	Value
Pre-quant val_bpb	1.1418
Int6 roundtrip val_bpb	1.1485
Int6 sliding val_bpb (s64)	1.1248
Steps completed (600s cap)	7,051
Step time	85ms
Model params	26,829,913
Artifact size	15,612,308 bytes

Reproducibility (3 seeds)

Seed	Steps	Sliding s64	Artifact
2025	7,051	1.1248	15,612,308
42	7,061	1.1250	15,528,666
1337	7,063	1.1253	15,639,340

Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 QAT_THRESHOLD=0.1 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Note on Late QAT

The submitted code includes a Late QAT flag (LATE_QAT=1) intended to enable STE int6 fake-quantization in the final 4% of training. Post-submission analysis (credit: @152334H) revealed that torch.compile constant-folds the CastedLinear._qat_enabled class attribute at first trace, so the STE branch is dead-code-eliminated and never activates during training. Late QAT had no effect on the results. The score is driven entirely by Partial RoPE and LN Scale.

himanalot · 2026-03-21T06:23:41Z

yes! great job this is sort of where i went too

- Add FA3 > FA2 > SDPA attention backend dispatch - FA2 wrapper uses @torch.compiler.disable + fullgraph=False - FA3 uses fullgraph=True (compatible with torch.compile) - Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315) - Add pod_setup.sh with FA3/FA2 install flow - Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100

Rename folder to today's date. Replace train_gpt.py with the new baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Previous script preserved as previous_train_gpt.py. Update README with PR lineage and new baseline context.

…unner Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315 train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Update run script to use PR openai#315 config for both baseline and experiment.

… log

records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/train_gpt.py

- Rebased train_gpt.py on PR openai#315 (1.1248 BPB SOTA) - Added SGD TTT and causal TTT variant - Added gradient-guided adaptive quantization (int5/int6/int7) - Added z-loss regularization - Updated plan with current landscape and run commands

Merged records from all experiment branches into one working branch. Updated CLAUDE.md with current competitive landscape and next priorities. Rewrote idea bank with tiered roadmap for closing the gap to openai#315.

@152334H

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

@152334H

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

@152334H

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.

Safe config matching PR openai#315 proven techniques: - 11 layers, MLP 3x (1536), BigramHash 2048 - Muon backend_steps=5, momentum=0.99 (proven by all top PRs) - XSA on last 4 layers, Partial RoPE 16/64, LN Scale, Late QAT - EMA decay=0.997 every 4 steps via torch._foreach_lerp_ - CUDA_DEVICE_MAX_CONNECTIONS=1 for multi-GPU overlap - SmearGate, OrthoInit, int5 MLP/int6 attention, zstd-22

…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)

… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.

ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1) Both env-var gated, default off — existing runs unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

bopmite added a commit to bopmite/parameter-golf that referenced this pull request Mar 21, 2026

PR openai#315 base + OLB + deployment ready

54e6e37

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026

update README and submission to match PR openai#315 format

1707fb6

saml212 mentioned this pull request Mar 21, 2026

Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332

Open

alertcat mentioned this pull request Mar 21, 2026

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338

Open

sheeki03 mentioned this pull request Mar 21, 2026

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364) #339

Open

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Fix LN Scale to match openai#315, add U-Net skip connections, fix QAT…

3150afd

… log

sjp611 mentioned this pull request Mar 21, 2026

Non-record: PR315 repro on 1xH100 PCIe, int6+zstd (val_bpb=1.8338) #356

Open

3 tasks

152334H reviewed Mar 21, 2026

View reviewed changes

records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/train_gpt.py Show resolved Hide resolved

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

2951651

jfprincz force-pushed the submission/11l-partialrope-lateqat-1.1248 branch from dfb05a5 to 2951651 Compare March 21, 2026 21:01

jfprincz changed the title ~~Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)~~ Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) Mar 21, 2026

felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026

results: Run 4 — 1.1242 BPB, beats PR openai#315 SOTA

babca60

alia-abbas added a commit to alia-abbas/parameter-golf that referenced this pull request Mar 21, 2026

reptile meta-TTT on openai#315 base

1cb087f

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Add finding: Late QAT is dead code under torch.compile

a2d8bcc

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Add finding: Late QAT is dead code under torch.compile

2fbdc32

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Add finding: Late QAT is dead code under torch.compile

2ebcf79

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

charmquark1984 mentioned this pull request Mar 21, 2026

Non-record: Negative results & insights from 24hrs on 8xH100 #375

Open

trasnake87 mentioned this pull request Mar 22, 2026

Record: 11L Int5-All + XSA5 + EMA + 10% Pruning (val_bpb=1.1466) #389

Open

cmcdnd mentioned this pull request Mar 22, 2026

Non-record: 27M params at Int5 QAT / train larger, quantize harder (val_bpb=1.1418) #469

Closed

harsha-gouru mentioned this pull request Mar 23, 2026

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522) #477

Closed

3 tasks

mrdavtan mentioned this pull request Mar 23, 2026

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481

Closed

alia-abbas added a commit to alia-abbas/parameter-golf that referenced this pull request Mar 23, 2026

adding openai#315 code

9d7289e

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

Asukabot0 mentioned this pull request Mar 23, 2026

Record: 11L NonTTT VR+GA MixedInt5/6: val_bpb=1.1428 (3-seed, 8xH100) #516

Closed

rarce mentioned this pull request Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534

Closed

6 tasks

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

rarce mentioned this pull request Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #543

Open

6 tasks

This was referenced Mar 23, 2026

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) #456

Open

cocohearts merged commit cdabe13 into openai:main Mar 23, 2026

cocohearts mentioned this pull request Mar 23, 2026

Update README leaderboard with merged record submissions #561

Merged

teddyoweh mentioned this pull request Mar 24, 2026

Record: 11L Sidecar48 + Enhanced TTT (cosine LR, 20 epochs) — 1.0698 BPB (3-seed mean) #581

Closed

anantdgoel mentioned this pull request Mar 24, 2026

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB #601

Open

This was referenced Mar 24, 2026

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT #598

Open

Non-record: 11L GEPA + 20k Steps + Pure Int6 + Legal TTT (val_bpb=1.0983): unlimited compute: 4×A100-40GB, ~2.8 hours #628

Open

Asukabot0 mentioned this pull request Mar 24, 2026

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638

Closed

This was referenced Mar 24, 2026

Non-record: 11L GEPA + 25k Steps + Pure Int6 + Legal TTT (val_bpb=1.0944) - unlimited compute category #644

Open

Non-record: 11L GEPA + 30k Steps + Pure Int6 + Legal TTT (val_bpb=1.0920) #668

Open

Asukabot0 mentioned this pull request Mar 25, 2026

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean) #728

Open

Asukabot0 mentioned this pull request Mar 25, 2026

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) #761

Open

minh-stakc mentioned this pull request Mar 25, 2026

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672) #770

Open

4 tasks

This was referenced Mar 26, 2026

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1215) #754

Closed

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT — val_bpb 1.1215 (3-seed mean) #838

Open

nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026

Merge pull request openai#315 from jfprincz/submission/11l-partialrop…

33d159b

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248

jfprincz commented Mar 21, 2026 •

edited

Loading

Uh oh!

himanalot commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jfprincz commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!